Chojecki Przemysław
import dalex as dx
import pandas as pd
import pickle
import xgboost as xgb
import numpy as np
from sklearn.model_selection import train_test_split
input_df = pd.read_csv('new_preprocessed_dataset.csv')
y = input_df.loc[:,'Attrition']
X = input_df.drop('Attrition', axis='columns')
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=14)
path_xgb = 'xgb_model.p'
xgb = pickle.load(open( path_xgb, "rb" ))
xgb_explainer = dx.Explainer(xgb, X_train, y_train, label='XGB')
Preparation of a new explainer is initiated -> data : 7595 rows 21 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 7595 values -> model_class : xgboost.sklearn.XGBClassifier (default) -> label : XGB -> predict function : <function yhat_proba_default at 0x7fc5848d18b0> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 1.49e-06, mean = 0.159, max = 1.0 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.699, mean = 9.19e-05, max = 0.82 -> model_info : package xgboost A new explainer has been created!
path_rf = 'random_forest_model.p'
rf = pickle.load(open( path_rf, "rb" ))
rf_explainer = dx.Explainer(rf, X_train, y_train, label='RF')
Preparation of a new explainer is initiated -> data : 7595 rows 21 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 7595 values -> model_class : sklearn.ensemble._forest.RandomForestClassifier (default) -> label : RF -> predict function : <function yhat_proba_default at 0x7fc5848d18b0> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 0.0, mean = 0.159, max = 1.0 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.37, mean = 0.000521, max = 0.44 -> model_info : package sklearn A new explainer has been created!
path_l1 = 'l1_log_reg.p'
l1 = pickle.load(open( path_l1, "rb" ))
l1_explainer = dx.Explainer(l1, X_train, y_train, label='L1')
Preparation of a new explainer is initiated -> data : 7595 rows 21 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 7595 values -> model_class : sklearn.linear_model._logistic.LogisticRegression (default) -> label : L1 -> predict function : <function yhat_proba_default at 0x7fc5848d18b0> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 4.17e-06, mean = 0.159, max = 0.984 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.879, mean = 0.000387, max = 0.999 -> model_info : package sklearn A new explainer has been created!
/Users/dtgt/anaconda3/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator LogisticRegression from version 0.24.1 when using version 0.23.2. This might lead to breaking code or invalid results. Use at your own risk. warnings.warn(
xgb_pdp = xgb_explainer.model_profile(random_state=14)
rf_pdp = rf_explainer.model_profile(random_state=14)
l1_pdp = l1_explainer.model_profile(random_state=14)
Calculating ceteris paribus: 100%|██████████| 21/21 [00:01<00:00, 12.49it/s] Calculating ceteris paribus: 100%|██████████| 21/21 [00:06<00:00, 3.10it/s] Calculating ceteris paribus: 100%|██████████| 21/21 [00:00<00:00, 28.66it/s]
xgb_pdp.plot([rf_pdp, l1_pdp])
Wartości zdecydowanej większości kolumn niemają wpływu na predykcją. Wybierzmy te, które mają:
wybrane_kolumny = ['Total_Trans_Amt', 'Total_Revolving_Bal', 'Total_Ct_Chng_Q4_Q1', 'Total_Amt_Chng_Q4_Q1', 'Contacts_Count_12_mon']
xgb_pdp = xgb_explainer.model_profile(random_state=14, variables=wybrane_kolumny)
rf_pdp = rf_explainer.model_profile(random_state=14, variables=wybrane_kolumny)
l1_pdp = l1_explainer.model_profile(random_state=14, variables=wybrane_kolumny)
Calculating ceteris paribus: 100%|██████████| 5/5 [00:00<00:00, 8.31it/s] Calculating ceteris paribus: 100%|██████████| 5/5 [00:01<00:00, 2.65it/s] Calculating ceteris paribus: 100%|██████████| 5/5 [00:00<00:00, 29.91it/s]
xgb_pdp.plot([rf_pdp, l1_pdp])

Total_Trans_Amt sumaryczna wartość transakcji oraz Total_Revolving_Bal sumaryczna wartość długu opóźniona na następny okres rozliczeniowy, są najważniejszymy kolumnami dla modelów. Tu też tak wychodzi.Total_Amt_Chng_Q4_Q1 i Total_Ct_Chng_Q4_Q1 mówią o spadku używania usług bankowych przez klienta. Według modeli sprzyja to odejściu z banku. Model L1 uznał wpływ zmiennej Total_Amt_Chng_Q4_Q1 za niewystarczający, aby przekroczył wartość kary L1, dlatego odserwujemy całkowite pominięcie tej smiennej.Contacts_Count_12_mon, jest z jakiegoś powodu bardzo wyróżniona, gdy wynosi $6$. Zgadza się to z wnioskami z poprzednich PD.xgb_ale = xgb_explainer.model_profile(type = 'accumulated', random_state=14)
rf_ale = rf_explainer.model_profile(type = 'accumulated', random_state=14)
l1_ale = l1_explainer.model_profile(type = 'accumulated', random_state=14)
Calculating ceteris paribus: 100%|██████████| 21/21 [00:01<00:00, 11.47it/s] /Users/dtgt/anaconda3/lib/python3.8/site-packages/tqdm/std.py:697: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version Calculating accumulated dependency: 100%|██████████| 21/21 [00:04<00:00, 4.23it/s] Calculating ceteris paribus: 100%|██████████| 21/21 [00:07<00:00, 2.78it/s] /Users/dtgt/anaconda3/lib/python3.8/site-packages/tqdm/std.py:697: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version Calculating accumulated dependency: 100%|██████████| 21/21 [00:05<00:00, 3.82it/s] Calculating ceteris paribus: 100%|██████████| 21/21 [00:00<00:00, 24.25it/s] /Users/dtgt/anaconda3/lib/python3.8/site-packages/tqdm/std.py:697: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version Calculating accumulated dependency: 100%|██████████| 21/21 [00:04<00:00, 4.27it/s]
xgb_ale.plot([rf_ale, l1_ale])
Równierz i w przypadku tych wykresów sprawdźmy tylko 5 najistotniejszych zmiennych.
xgb_ale = xgb_explainer.model_profile(type = 'accumulated', random_state=14, variables=wybrane_kolumny)
rf_ale = rf_explainer.model_profile(type = 'accumulated', random_state=14, variables=wybrane_kolumny)
l1_ale = l1_explainer.model_profile(type = 'accumulated', random_state=14, variables=wybrane_kolumny)
Calculating ceteris paribus: 100%|██████████| 5/5 [00:00<00:00, 12.21it/s] /Users/dtgt/anaconda3/lib/python3.8/site-packages/tqdm/std.py:697: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version Calculating accumulated dependency: 100%|██████████| 5/5 [00:01<00:00, 4.10it/s] Calculating ceteris paribus: 100%|██████████| 5/5 [00:01<00:00, 2.78it/s] /Users/dtgt/anaconda3/lib/python3.8/site-packages/tqdm/std.py:697: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version Calculating accumulated dependency: 100%|██████████| 5/5 [00:01<00:00, 4.03it/s] Calculating ceteris paribus: 100%|██████████| 5/5 [00:00<00:00, 16.81it/s] /Users/dtgt/anaconda3/lib/python3.8/site-packages/tqdm/std.py:697: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version Calculating accumulated dependency: 100%|██████████| 5/5 [00:01<00:00, 3.80it/s]
xgb_ale.plot([rf_ale, l1_ale])

Z jakiegoś powodu kolumny pomieszały się kolejnością :(
Total_Ct_Chng_Q4_Q1 i Total_Amt_Chng_Q4_Q1 znów okazały się mieć dyżt wpływ. Tym razem znów wykres pokazuje, że przy 2-krotnym spadku używania przez klienta karty następuje przełom w predykcji na korzyść odchodzenia od usług.rf_ale.result['_label_'] = "ALE_XGB"
rf_pdp.result['_label_'] = "PDP_XGB"
rf_ale.plot(rf_pdp)

Wkresy dla modelu lasu losowego są przesunięte, ale nie są znieksztaucone. Oznacza to, że model nie wykrywa nadmiernie zależności między zmiennymi.